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Power management in computers, why? 


ə To lower the power consumption of Data Centers; 
ə To increase the battery life of mobile computers; 


ə To have quieter and slimmer devices. 


Reverse engineering power management, why? 
Power management is: 

ə at least partially-assisted by software; 

ə almost entirely non-documented; 


often considered to be a manufacturer secret; 


© 


ə thus poorly studied /implemented in open drivers; 


© 


this is especially true in the GPU world. 
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2] General overview of ways to save power 
@ Origin of the power consumption 
@ Usual ways of saving power 
@ Areas of application 
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Power consumption of a logic gate 
P= Pstatic al Penis 


Pstatic : Small transistors leak current even when “blocked” 
Paie = V« lleak 
leak depends on the voltage and the etching of the transistors. 


Paynamic : Fighting the gate capacitance when switching 
e Paynamic = CfV?; 
ə C: Capacitance of the gate (fixed); 


ə f: Frequency at which the gate is switched; 
e V: Voltage at which the gate is powered. 
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The dynamic and static power cost 
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Usual ways of saving power 


ə Clock gating: Cuts the dynamic-power cost; 
ə Power gating: Cuts all the power cost; 


ə Reclocking: Adjusts the clock frequency and voltage. 


Clock gating: Stopping the clock of un-used gates 
ə Update rate: Every clock cycle; 
e Effectiveness: Cuts the dynamic-power cost entirely; 
ə Drawbacks: Increase of the complexity of the clock tree; 


ə Executed by: Hardware. 
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Update rate: Around a microsecond; 
ə Effectiveness: Cuts the power cost entirely; 
ə Drawbacks: May need to save the context before shutdown; 


Executed by: Hardware and/or software. 


ə Update rate: Around a millisecond; 
e Effectiveness: Impacts the static- and dynamic-power cost; 
ə Drawbacks: Affects performance; 


ə Executed by: Software. 
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ə Find the bottleneck using performance counters; 
@ Lower the clocks of all the other clock domains; 
ə Lower the voltage of the power domains based on clocks; 


e Increase the clock of the bottleneck clock domain; 


ə Repeat and learn about application patterns. 


finding the bottleneck fast-enough; 
@ predicting the needed-voltage based on clocks’ frequencies; 
ə calculating the memory timmings on-the-fly; 


supporting any combinaison of clocks. 
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General overview of ways to save power 
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A simple clock domain's clock tree 
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Figure : Clock tree for the core clock domain on nv84 
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General overview of ways to save power 


Usual ways of saving power 


Places to apply the proposed solutions 
ə card-level power gating (optimus); 

internal engines; 

VGA DACs; 

PCle port (ASPM); 


anything using a clock and being part of a power domain. 
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PCle ASPM impact 
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Figure : Maximum power consumption of the PCle port at various link 
configurations. 
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Summary 


@ PCOUNTER 
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PCOUNTER 
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PCOUNTER — Overview 


Performance counters 
are blocks in modern processors that monitor their activity; 
count hardware events such as cache hit/misses; 


are tied to a clock domain; 


provide load information needed for DVFS’s decision making. 
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Figure : Example of a simple performance counter 
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PCOUNTER — Overview of a domain 


So 
Truth 


Multi- 


Table 
S 


X 


signal 
9 Events 


signis Truth Macro X 
plexer Table | signal Events 
IE 
plexer Table | signal Events 
plexer Table | signal Events 


Clock IX 


Figure : Schematic view of a domain from PCOUNTER 


14 / 28 


PCOUNTER 
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per-channel/process counters in PGRAPH; 
ə same logic as PCOUNTER; 
@ require running an opencl kernel to read them; 


share some in-engine multiplexers with PCOUNTER. 


PDAEMON 


ə 4 global counters; 


@ very simplified logic; 


ə usually about the business of the other engines. 


PCOUNTER 
0o00% 
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PCOUNTER signals 


@ very chipset-dependent; 


ə about 150 signals reverse engineered on nv50; 
@ thanks to Marcin (mwk) and Samuel Pitoiset (GSoC 2013). 


MP counters signals 


ə all GPGPU signals exported by cupti on Fermi+ reversed; 


ə thanks to Christoph Bumiller (calim) and Samuel Pitoiset. 


PDAEMON's signals 


ə 5 signals known; 


@ thanks to Marcin Koscielnicki (mwk). 
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Summary 


@ PTHERM 
@ Thermal management 
e FSRM 
@ Power regulation 


PTHERM — Thermal management 


PTHERM's thermal management 
ə sends IRQs to the host when reaching temperature thresholds; 
@ can cut the power of the card through a GPIO; 
@ can force the fan to the maximum speed; 


@ can lower the frequency of the main engine of the GPU 
(through FSRM). 
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Frequency-Switching Ratio Modulation (FSRM) 


ə is used to lower the frequency of the main engine of the GPU; 


ə is useful to lower the temperature or the power consumption; 


@ is triggered automatically when reaching thresholds. 


How can the FSRM lower power consumption? 


ə A divided clock is generated from the main engine's clock; 
e The clock must be divided by a power-of-two (2 to 16); 
ə It can generate any clock frequency between these two clocks; 


With a lower clock, an engine consumes less power. 


PTHERM 
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PTHERM — Frequency-Switching Ratio Modulation 
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Figure : Frequency of the core clock (original @ 408MHz) when using a 
16-divider and varying the FSRM 
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PTHERM — Power estimation 


Calculating the power consumption 


PTHERM estimates power consumption by: 
e reading every block’s activity (in use or not); 
@ summing the weighted activity blocks signals; 


ə applying a low pass filter. 
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Figure : Extract of NVIDIA's patent on power estimation (US8060765) 
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Power limitation 


PTHERM's power limitation can 


@ read the power consumption by counting the active blocks; 


ə update the FSRM ratio to stay in the power budget; 


@ use two hysteresis windows for altering the FSRM ratio; 


Figure : 


ə do all that automatically. 
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Example of the power limiter in the dual window mode 
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PTHERM 
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Power limitation — Actual implementation of NVIDIA 


Power limitation — Actual implementation 


ə NVIDIA doesn't use PTHERM to implement power limitation; 


e It may read power consumption from the voltage controller; 


@ and downclock the card when exceeding the budget. 
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Figure : Effect of disabling the power limiter on the Geforce GTX 580. 
Copyrights to Wizzard from techpowerup.com. 
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Summary 


@ PDAEMON 
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PDAEMON 
ə is an RTOS embedded in every new NVIDIA GPU (Fermi+); 
clocked at 200MHz and is programmed in the FuC ISA; 


@ has access to all the registers of the card; 


© 


@ can catch all the interrupts from the GPU to the Host; 


ə features internal performance counters. 


NVIDIA’s usage of PDAEMON 


ə Fan management; 


e Hardware scheduling (for memory reclocking); 
ə Power gating and power budget enforcement; 


ə Performance and system monitoring. 
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Summary 


Q Conclusion 
@ Conclusion & Future work 
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Conclusion 
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The GPU as an autonomic system 

The GPU can: 
ə self-configure: thanks to PDAEMON that can act as a driver; 
ə self-optimise: using the performance counters; 


e self-heal: recovering from over-temperature/current; 


ə self-protect: GPU users are isolated in separate VM. 


Implement stable reclocking across all GPUs; 
ə Write a test-bed for DVFS algorithms implementations; 


ə Document clock- and power-gating details; 


Reverse engineer more performance-counter signals. 
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Questions & Discussions 


Questions & Discussions 
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